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Abstract 

More than a decade ago, a number of methods were proposed for the inference of protein interactions, using 
whole-genome information from gene clusters, gene fusions and phylogenetic profiles. This structural and evolution- 
ary view of entire genomes has provided a valuable approach for the functional characterization of proteins, espe- 
cially those without sequence similarity to proteins of known function. Furthermore, this view has raised the real 
possibility to detect functional associations of genes and their corresponding proteins for any entire genome 
sequence. Yet, despite these exciting developments, there have been relatively few cases of real use of these methods 
outside the computational biology field, as reflected from citation analysis. These methods have the potential to be 
used in high-throughput experimental settings in functional genomics and proteomics to validate results with very 
high accuracy and good coverage. In this critical survey, we provide a comprehensive overview of 30 most prominent 
examples of single pairwise protein interaction cases in small-scale studies, where protein interactions have either 
been detected by gene fusion or yielded additional, corroborating evidence from biochemical observations. Our con- 
clusion is that with the derivation of a validated gold-standard corpus and better data integration with big experi- 
ments, gene fusion detection can truly become a valuable tool for large-scale experimental biology 
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INTRODUCTION 

It is just over 10 years ago and prior to the decoding 
of the first human genome sequence that a set of 
key computational methods collectively known as 
'genome context' methods have been developed, 
heralding a new wave of genome bioinformatics 
[1]. These methods, exploiting for the first time 
the structural and evolutionary features of genomic 
sequences, were shown to be able to accurately infer 
functional associations of genes and their correspond- 
ing protein interactions. The three most highly 



acclaimed methods of this kind were phylogenetic 
profiling (based on co-evolutionary patterns across 
genomes) [2, 3], conserved gene clusters (based on 
proximal genomic structures) [4-6] and gene fusion 
detection (also known as the Rosetta Stone 
method — based on distal genomic elements across 
species) [7-9], extensively reviewed elsewhere [1]. 

Using gold-standard data sets compiled from the 
emerging large-scale functional genomics and prote- 
omics experiments, an increasingly wider range of 
reference genomes and a mixture of variants and 
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parameters [10, 11], these 'genome-aware' sequence 
analysis methods and in particular gene cluster/fusion 
detection, have yielded an impressive level of per- 
formance and accuracy [12]. 

While it is widely appreciated that gene fusion 
analysis has its roots in the early observations of 
such events in the entire genome sequences of cellular 
organisms ever published, including those of 
Haemophilus influenzae and Methanococcus jannaschii, 
the first report of such an explicit prediction has re- 
mained rather obscure. This case dates back to 1997, 
when it was observed that the distal gene pair ThiD 
(HI0416) and TenA (HI0358) from H. influenzae pre- 
sented similarity to the 'composite' protein thi-4 from 
yeast (in this order, N- and C-termini), unlike gene 
MJ0236 from M. jannaschii [13]; the concluding re- 
marks of that study pointed to the remarkable fact 
that this functionally associated pair (on the basis of 
its similarity to thi-4) was not 'proximal' in bacterial 
genomes, as observed elsewhere [4] . This unique pre- 
diction for the interaction of ThiD and TenA was a 
first step toward the invention of automated methods 
for protein interaction inference in entire genome 
sequences — a prediction in fact that has been subse- 
quently confirmed by experimental analysis [14]. 

Much followed since, and a number of high- 
profile reports announced the arrival of new methods 
such as gene clusters [6], gene fusion [8] and phylo- 
genetic profiles [3]. In particular, gene fusion analysis 
has provided a basis for the detection of protein 
interactions in whole genomes [8, 9]. Compared 
with the other genome-aware methods above, it 
was shown to be far more reliable with respect to 
precision (i.e. high-quality predictions with few false 
positives) [10], albeit with lower coverage as ex- 
pected. This method is based on the observation of 
two separate genes in one genome found to be fused 
in another genome (Figure 1). 

The assumption is that the two separate genes in 
the first organism tend to be functionally linked [8]. 
For all the success and the extremely high citation 
rates of these methods (Table 1) and despite (or pos- 
sibly because of) the subsequent advancement of 
experimental proteomics, these methods have not 
been used extensively in experimental settings and 
on a large scale. Moreover, within the vast number 
of publications citing the original genome-aware 
methods, there exists an inordinate number of com- 
putational biology citations (data not shown). It is 
somewhat ironic that while these methods were pri- 
marily developed to support experimental work and 
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Figure I: A pictorial representation of the gene fusion 
detection/association inference process. A connposite 
protein (bottom) with two donnains exhibits sequence 
similarities to two component homologs [Component 
I (green) and Component 2 (blue) with 360 and 450 
amino acid residues (aa), respectively — not shown]. 
The total length of the fictitious protein sequence is 
1200 residues, drawn to scale — unit shown (120 resi- 
dues). Networks of associations, with nodes (grey) cor- 
responding to genes/proteins and links (purple) 
depicting pairwise interactions, can thus include the 
corresponding (color-coded) component proteins iden- 
tified by their similarity to composite proteins and 
inferred to be functionally linked. 



assist the validation of proteomics analyses, 
large-scale studies apparently did not fmd much use 
in these approaches (see below). 

All three approaches and their variants have col- 
lectively received over 6000 citations in the current 
literature (Table 1), signifying a new era in the ana- 
lysis of genomic sequences and their real potential for 
the inference of protein interactions, or more gen- 
erally functional associations. Yet, this astonishing 
number of citation records, almost half of that for 
the first publication of the human genome sequence, 
does not exactly correspond to a seamless use of these 
data into experimental pipelines, as indicated by a 
relative low number of experiments in direct use of 
those methods. 

Indeed, a best-practice approach might be the in- 
ference of protein interactions following validation 
by experiment or conversely, the detection of 
(typically a multitude of) protein interactions sub- 
sequently corroborated by computational analysis. 
In either case, the interplay of a wide range of 
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Table I: Citation analysis of key methods — Google 
Scholar, 20 May 2012 



Method 


Primary reference 


No. of 

citations 


Phylogenetic profiles 


Ouzounis and Kyrpides (1996) [2] 


54 




Pellegrini et al. (1999) [3] 


1361 


Gene order 


Tamames et al. (1997) [4] 


15! 




Dandekar et al. (1998) [5] 


786 




Overbeek et ai (1999) [6] 


896 


Gene fusion 


Marcotte et al. (1999) [7] 


1320 




Enright et al. (1999) [8] 


906 




Marcotte et al. (1999) [9] 


813 


Total number of citations (approximately) 


>6000 



experimental techniques with the computational de- 
tection and inference of these associations can sub- 
stantially increase both the efficiency and accuracy of 
large-scale experiments. 

In this critical survey, our intention is to demon- 
strate this best-practice approach for individual stu- 
dies of protein interactions using gene fusion and 
propose how the particular method — or more gen- 
erally all genome-aw^are approaches — can be put in 
good use for large-scale proteomics. We review^ a 
heterogeneous, scattered body of know^ledge in the 
literature w^here such benefits have been reported 
w^ith the successful detection of protein interactions 
using a mixture of experiment and computation. We 
provide an assorted list of experimental findings of 
validated protein interactions, w^ith the intention to 
reassure potential users of the merits of gene fusion 
analysis in this context and underline the need for 
integration of advanced sequence analysis v\^ith main- 
stream proteomics [15]. 

We thus argue that gene fusion detection and gen- 
erally genome-aw^are sequence analysis, foUow^ing a 
decade of active development, might be ripe for use 
in real-w^orld experimental settings on a large scale, 
as reflected in todays' big biology. 

EVIDENCE FOR THE INFERENCE 
OF FUNCTIONAL ASSOCIATIONS 
VIA GENE FUSION DETECTION 

Here, v^e provide strong evidence in support of the 
method in 30 case studies (Table 2) v^^hich cite the 
original publications [7, 8] and refer mostly to ex- 
perimental rather than computational w^ork. It is not 
aWays clearly reported w^hether there W2is a direct 
use of this particular method, yet it is important to 



review^ the valuable experimental evidence in sup- 
port of gene fusion detection. It is encouraging to see 
comprehensive review^s w^here experimental infor- 
mation is summarized hand-in-hand with computa- 
tional evidence, thus expanding our understanding 
of functional properties of certain protein classes, 
e.g. glutaredoxins [16] and their specificities [17]. 
This integration can provide a more profound char- 
acterization of entire cellular processes v^ith the add- 
itional element of the ever increasing availability of 
entire genome sequences [18, 19]. 

Implicit use of gene fusion analysis in 
wider studies 

Before the detailed description of case studies w^here 
gene fusion has been used explicitly either as a guid- 
ing principle or as confirmatory evidence, it is w^orth 
mentioning a number of analyses w^hich use this 
approach indirectly. These reports range from com- 
parative studies of entire gene families or classes and 
their evolutionary history, to functional studies of 
cellular modules. An example of a comparative 
study is represented by extensive structure— function 
analyses of ribulose-l,5-bisphosphate (RuBP) 
carboxylase/ oxygenase (RubisCO)-like proteins 
[20, 21]. Examples of detailed functional studies are 
illustrated by the quest for putative cancer biomarker 
associations for proteins Ki67 [22] and Bcl-xL 
[23, 24], both detected in breast cancer. 

Structure-based screens of interactions for specific 
molecular partners have been devised to accelerate 
protein interaction discovery, indirectly based on the 
premise that functional specificity of potential part- 
ners is also reflected by their phylogeny. One such 
example is the analysis of interactions betw^een 
histidine-containing protein (HPr) and carbon catab- 
olite protein A (CcpA) in Bacillus subtilis [25]. Other 
cases include the fusion of HisH/F, two histidine 
biosynthesis enzymes, predicted to interact through 
structural analysis [26] and the plant PHYLLO locus 
for vitamin biosynthesis, present in photosyn- 
thetic cyanobacteria as a homologous gene cluster 
of the men (F/D/C/H) genes [27]. 

Explicit use of gene fusion analysis: from 
computation to experiment 

Here, we discuss cases of potential protein inter- 
actions that have been detected through initial 
inference by computation w^hich guided detailed ex- 
perimentation and, w^here possible, validated 
biochemically (Table 2). We number all cases from 
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Table 2: The 30 cases of protein interaction evidence from gene fusion events 



rruicin pdir 


Year 


Ref. 


v«ui 1 II 1 icni 


Case 




Peroxidase/FAD-oxidase 


2000 


[281 


Analysis of composite, histology 


01 


20149640 


MOCSIA/B 


2000 


[30] 


Possible fusion, bicistronic gene 


03 


3559907 


Nit/Fhit 


2000 


[48] 


Sequence/structure determination 


13 


9955180 


UEVI/Kua 


2000 


[72] 


Differential hybrid expression 


29 


64 4 8867 (220675525) 


AKINPy/AKINII 


2001 


[46] 


Complex biochemistry/genetics 


II 


18390971 


wxcM composite 


2001 


[57] 


Biochemical characterization 


18 


14090396 


RAD30/CTF7 


2001 


[661 


Indirect evidence, confirmed in [67] 


24 


7678718 


Fab-G/-A/SCP2-like 


2001 


[71] 


Multi-functional association 


28 


486419 


MsrA/SeIR 


2002 


[41] 


Biochemical characterization 


08 


3252888 


PAI957/I958 


2002 


[61] 


Biochemical/genetic experiments 


21 


730107 


4E-BP3/MASK 


2003 


[29] 


Putative interaction 


02 


27451489 


EPXH2 composite 


2003 


[33] 


Functional analysis of two domains 


04 


181395 


Allene oxide synthase 


2003 


[55] 


EPR spectroscopy analysis 


16 


23396450 


MsPpml/2 (MtPpml) 


2003 


[56] 


Two-hybrid system in vivo interaction 


17 


15609188 


i^i^AA (i^eaB)/MCi^-ICi^ 


2004 


[52] 


Biochemical evidence for complex 


14 


581476 


BCSI (Tarl/TarJ) 


2004 


[60] 


Complex formation and catalysis 


20 


471234 


IspD/F (+lspE) 


2004 


[68] 


Structural analysis and fusion detection 


25 


12230305 


burs-a/P 


2005 


[45] 


Possible heterodimer activator 


10 


62529362 


PitA (cid/monooxygenase) 


2006 


[35] 


Putative interaction, biochemistry 


05 


292656006 


SYNW2462/2463 


2006 


[44] 


Supported by expression data 


09 


36955582 


CysN/CysC (NodQ) 


2006 


[62] 


Interpretation of structure/function 


22 


46313 


l^onooxygenase/trHb 


2007 


[38] 


Structural indications 


06 


29606967 


Bh0493/mannitol dh 


2008 


[58] 


Prediction for composite case 


19 


348670788 


NirK/Nim 


2009 


[69] 


Protein structure complex 


26 


34497462 


RJL/DnaJ 


2009 


[73] 


Evolutionary analysis 


30 


23821015 


l^eaB/ICM 


2010 


[54] 


Indirect evidence of association 


15 


91781568 


GfcC/GfcD 


2011 


[40] 


Precise prediction, structure 


07 


257140810 


NodGS-lil<e FluG 


2011 


[47] 


Nod/GS-like FluG, various techniques 


12 


67537298 


TagF/PppA 


2011 


[64] 


Confirmatory experimental evidence 


23 


358005017 


Cass2 (l^arA/Rob) 


2011 


[70] 


Structure determination of Cass2 


27 


225734311 



Protein pair, names of genes and proteins involved in gene fusion (see text) — where possible, the name of the composite protein is provided; Year, 
year of publication; Ref., reference; Comment, short comment for the special features of each case, for more information please see text and original 
reference; Case, number as in text. Composite Gl, NCBI gene identification number for the composite protein sequence, either the most relevant 
protein or a representative of a wider case. The full composite sequence collection is available at the following publicly accessible URL: http:// 
www.ncbi.nlm.nih.gov/sites/myncbi/collections/public/IRWJxAcY5x5tj-gzaTirhhG/. In total, 31 Gl numbers are provided — including a double count 
for Case 29 Table entries are sorted by chronological order and (within each year) by order of citation in main text. Please note that not all cases 
are fully annotated in their corresponding sequence database records; for reasons of symmetry database cross-references e.g. from CDD [74] or 
PFam [75] are thus not provided, these links can be extracted from the corresponding records through the composite Gl (reference). 



01 to 30 (marked in bold), in a sequential manner 
and across different approaches for easy reference. 

Tentative interaction predictions 

01 Early observations of 'fusion' of stand-alone do- 
mains have provided confirmatory evidence that 
their components allude to possible interactions, es- 
pecially for longer proteins. One such example is the 
functional characterization of thyroid NADPH 
oxidases (ThoXl, ThoX2) v^ith an N-terminal per- 
oxidase domain and a C-terminal NADP-/FAD- 
oxidase domain [28]. 

02 Intriguingly, rare cases of mammalian genes 
such as the reported 4E-BP3 (elF-binding protein)- 
MASK fusion transcript across different reading 
frames point to possible associations of the native 



gene products in similar regulatory pathways, 
although it has not been possible to confirm this 
prediction through literature [29]. 

03 Another peculiar instance of gene fusion at the 
transcript level has been presented for the 
MOCSIA/B pair, the first enzymes in the pathway 
of molybdopterin biosynthesis [30]. The MOCSl 
locus corresponds to the highly conserved bacterial 
MoaA ortholog; curiously, the last steps in this path- 
way involve bacterial genes MoeA and MogA, 
which are reportedly fused in certain eukaryotic 
gene homologs [31]. The nature of this possible 
interaction has not been yet elucidated, despite de- 
tailed structural knowledge [32]. 

04 Inspired by such methods, detailed specificity 
screens have been performed for bifunctional 
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enzymes, such as the human soluble epoxide hydro- 
lase (EPXH2) [33]. While the domains are well de- 
limited as a putative phosphatase (N-terminal) and 
epoxide hydrolase (C- terminal), their roles have 
not been understood in detail and v^ere subject to 
functional analysis for the delineation of their func- 
tion: human and mouse enzymes are bifunctional, 
while plant enzymes reportedly lack the phosphatase 
domain [33]. We note that two bacterial genes from 
Bradyrhizobium japonicum USDAllO map to the cor- 
responding mammalian composite protein [gene 
identification number (GI):27376073 and 
GI: 27376225), therefore augmenting the argument 
of interaction (Figure 2). These interesting discov- 
eries are further strengthened by structure simulation 
and mechanistic interpretations for catalytic activities 
of the fused complex [34] . 

05 A more compelling case of a clear prediction 
with experimental support has been provided for a 
family of proteins from halophilic bacteria, where a 
'bifunctional' protein containing an N-terminal 
chlorite dismutase domain (PF06778) and a 
C-terminal monooxygenase domain (PF03992) 
points to the possible interaction of these two 
enzyme families, supported by protein purification, 
limited proteolysis and mass spectrometry [35]. 
Implications for salt tolerance of this interaction 
remain an open issue: it is worth pointing out that 
similar chlorite dismutase enzymes have been found 
in other chemolithotrophic bacteria, indicating an 
ancient origin [36]; the original discovery in halo- 
bacteria has spurred an active area of fascinating re- 
search [37]. 

06 In parallel work, the monooxygenase domain 
(PF03992) has been found to be associated with a 
heme-containing protein, known as bacterial globin 
or trHb, on the basis of two fusion 'composite' pro- 
teins [38]. While interaction data were not available 
yet, this association is further supported by structural 
evidence from the pair IsdG/I in dimeric formations 
[39]. We note that IsdG and Isdl are separate genes in 
Staphylococcus aureus, yet present in consecutive order 




I 1 unit ISO residues (aa) 

Figure 2: Mapping of two component proteins from 
Bradyrhizobium japonicum onto the human composite 
protein EPXH2. GI numbers are provided. Drawn to 
scale as in Figure I. 



in the chlorophyta Ostreococcus tauri/lucimarinus 
(GI:308806403/GI:145348684) and elsewhere (data 
not shown). 

07 Recently, the 3D structure of protein GfcC, 
essential for assembly of group 4 polysaccharide cap- 
sule, has been reported in conjunction with a puta- 
tive interaction potential with GfcD [40]. This 
interaction is proposed based on the observation 
that the pair GfcC/D exhibits similarity to the 'com- 
posite' protein OtnG from Burkholderia species [40]. 
This prediction might be confirmed in the future 
when the high-resolution structure of GfcD is 
obtained. 

In all the above cases, there is credible evidence 
that the domains in question are functionally 
associated and potentially physically interacting. 
However, there is no direct experimental observa- 
tion confirming these precise predictions, as yet. 

Prediction-driven experiments 

08 One of the early discoveries that confirmed the 
prediction power of this method is the observation 
that the proteins peptide methionine sulfoxide re- 
ductase MsrA and Selenoprotein R (SelR) exhibit 
both a similar phylogenetic distribution across mul- 
tiple organisms and patterns of gene clusters or fu- 
sions (see Table 1 in [41]). This observation has led to 
the characterization of SelR as a methionine sulfox- 
ide reductase [41]. Interestingly, the MsrA/B gene 
fusion components were not detected as an interact- 
ing pair in specific systems [42], while later the pro- 
tein structure of a MsrA/B fusion (composite) 
protein provides detailed explanations for the earlier 
negative biochemical findings [43]. 

09 Full-scale studies with explicit use of gene 
fusion detection and pathway inference have also 
appeared, for example the computational derivation 
of a network for nitrogen assimilation in the cyano- 
bacterium Synechococcus (WH8102), using data 
derived from comparative analysis that is confirmed 
by relevant expression studies [44]. A stunning ex- 
ample is the pair SYNW2462/3 which corresponds 
to a composite protein in other strains and is 
down-regulated by ammonium in the expression ex- 
periments [44], thus implicating this pair in a direct 
association. 

10 An interesting computation-driven experimen- 
tal analysis has been reported for the bursicon gene, a 
key factor for insect development [45] . The presence 
of two genes in Drosophila, found as a putative fusion 
gene in some other insect species, drove a series of 
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elegant experiments that demonstrated how the two 
highly similar, paralogous genes form a heterodimer 
which is involved in the activation of the receptor 
DLGR2 [45]. 

11 Another case of a bifunctional adaptor-regula- 
tor protein AKINPy with a composite structure has 
been identified in plants, composed of an N-terminal 
AMP-activated protein kinase (AMPK) (3- (KIS 
domain) and a C-terminal AMPK y-subunit 
(SNF4), itself interacting with SNFl-related protein 
kinases (SnRKs) [46]. 

12 A strikingly thorough study, employing a range 
of techniques, resulted in the identification of protein 
interaction between the plant N-terminal nodulin/ 
amidohydrolase (Nod) and the C-terminal glutamine 
synthase I (GS) domains, as inferred from the fungal 
composite NodGS-like FluG protein [47]. 

13 The centerpiece methodology of the gene 
fusion (Rosetta Stone) hypothesis has been adopted 
for the structural delineation of the Nit— Fhit inter- 
actions, known to share a common evolutionary dis- 
tribution as well as expression profiles [48]. On the 
basis of the above observations, an extensive sequen- 
cing effort has been made to discover more Nit (nitri- 
lase) homologs from species with Fhit (nucleotide- 
binding) genes [48], to amplify the initial hypothesis 
of their association. The structure determination of 
the composite Nit— Fhit from Caenorhabditis elegans 
(also widely present elsewhere) provides insights 
into the interaction of the two monomers as well as 
additional evidence that this hypothesis holds [48], 
extending beyond this instance [49—51]. 

14 Yet another phylogenetically inspired experi- 
mental analysis involves the McmC gene present 
in a number of bacterial genomes as an alleged 
fusion of methylmalonyl-CoA mutase (MCM) and 
MeaB (GTP-binding protein) [52]. This peculiar or- 
ganization of two genes where the N-terminus of 
McmC matches the C-terminus of MCM (and 
vice versa, with the former region corresponding 
to a putative coenzyme Bi2-binding site) while the 
central region of McmC is similar to MeaB, still 
points to a complex fusion event, implying an inter- 
action of the two component proteins, namely 
MCM and MeaB [52]. Biochemical assays confirm 
the expected activities of the two component pro- 
teins (including GTPase activity for MeaB), while 
complex formation has also been established [52]. 
The structure of both human homologs has been 
determined further providing support for this par- 
ticular interaction [53]. 



fgp— IBII f h - 

I 1 unit ISO msldties (aa) 

Figure 3: Mapping of the complex domain structure 
for IcmF in the actinobacterium N. farcinica IFM 10152, 
Gl:54023003, length 107! residues (aa); orange: cofac- 
tor-binding site; green: MCM; blue: ICM — see text for 
details. Drawn to scale as in Figure I. 



15 Furthermore, in parallel work this particular 
composite case has been identified as a fusion of 
isobutyryl-CoA mutase (ICM) and named appropri- 
ately as IcmF [54], in Nocardia farcinica for instance 
(GI: 54023003) (Figure 3, suggested domain struc- 
ture). This is one of the most challenging examples 
of substrate and cofactor specificity that has not yet 
been fully elucidated. 

16 Mechanistic studies of substrate coupling be- 
tween lipoxygenase (C-terminal) and catalase 
(N-terminal) have been inspired by the presence of 
this domain fusion in coral and other organisms. Coral 
allene oxide synthase (cAOS, catalase superfamily) and 
8 R-lip oxygenase in Plexaura homomalla fuse into a 
composite protein and were subject to a combination 
of spectroscopic and mutagenesis studies, confirming 
the genuine functional role of this association [55]. 

17 Using a two-hybrid system, it has been shown 
that the protein pair M5Ppml/2 in Mycobacterium 
smegmatis encoded as a single operon corresponds to 
an interaction pair as reflected by the composite 
structure of protein M^Ppml in M. tuberculosis 
[56] — indeed in multiple strains (data not shown). 
M^Ppml encodes a poly-prenol-P-Man synthase 
for the biosynthesis of the cell wall glycolipid lipoar- 
abinomannan, whose N-terminal domain corres- 
ponds to a putative membrane anchoring protein 
and C-terminal catalytic domain to the dolichol-P- 
Man synthase; the M5Ppml/2 component orthologs 
have been shown to interact and complement this 
function through heterodimerization, with M5Ppml 
having a synthase activity and M5Ppm2 transmem- 
brane segments that stabilize and augment the 
enzymatic function [56]. 

18 Finally, a significant example of mixed compu- 
tation and experiment for pathway inference and 
experimental validation with molecular genetics is 
the analysis of lipopolysaccharide biosynthesis in 
Xanthomonas campestris [57]. In this report, it is 
found that gene wxcM codes for a bifunctional 
enzyme, with its N-terminus acting as an 
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acetyltransferase and the C-terminus acting as 
a putative isomerase. These observations coupled 
with detailed experimentation led to the proposal 
that wxcM catalyzes two alternating steps in the 
biosynthesis of precursor molecules for this 
pathway [57]. 

Independent confirmations, twilight zone similarities 

19 Remote homologs are difficult to detect in fusion 
mode, as stated for the case of amidohydrolase super- 
family members [58]. In this case, yet again, certain 
homologs of the gene product under consideration 
namely Bh0493 characterized as uronate isomerase, 
exhibit strong similarity to 'composite' proteins, e.g. 
the Phytophthora sojae gene 347522 (GI:348670788), 
which contain a C-terminal amidohydrolase domain 
and a N- terminal mannitol dehydrogenase. 

20 Carrying this argument to the limit, there is also 
a possibility that the one of the two 'component' 
proteins might be analogous and not homologous 
and yet confer similar functional properties. The 
bifunctional composite protein Bcsl from H. influen- 
zae contains two domains, a IspD-/GlmU-like 
cytidyltransferase N- terminal domain and a FabG- 
like reductase C-terminal domain [59], in an ar- 
rangement reminiscent of the genes Tarl and TarJ 
in S. aureus. While Tarl shares similarity with the 
H. influenzae protein, TarJ does not; instead, it has 
been hypothesized that it carries out a similar reac- 
tion, later validated by detailed biochemical experi- 
ments, also confirming the direct physical interaction 
of the two subunits Tarl /J as a complex in S. aureus 
[60]. 

21 Another case of a missing biochemical function 
involving weak sequence similarities, that of ribosyl- 
nicotinamide kinase, has been identified using a 
mixture of comparative analysis including gene 
fusion [61]. In Escherichia coli (K-12 MG1655), the 
fused 'composite' protein contains both required 
enzyme /transport functions, while in Pseudomonas 
aeruginosa PAOl is represented by two neighboring 
genes, namely, PA1957 (kinase) /PAl 95 8 (trans- 
porter, pnuC homolog). In H. influenzae Rd, the 
transporter domain is encoded by the pnuC gene, 
thus representing the function of the composite or 
neighboring genes from E. coli and P aeruginosa, 
respectively [61]. Analysis with biochemical and gen- 
etic experiments provides strong evidence for the 
role of these proteins in the corresponding biochem- 
ical pathway [61]. 



Explicit use of gene fusion analysis: from 
experiment to computation 

Here, we discuss cases of experimentally delineated 
potential protein interactions which are further cor- 
roborated by a follow-up comparative analysis of the 
corresponding genes via the detection of relevant 
gene fusions. 

Interpretation in structure/function studies 

22 The structural analysis of Pseudomonas syringae ATP 
sulfurylase subunit CysN provides a credible explan- 
ation for the association of this domain with adeno- 
sine 5^-phosphosulfate kinase (CysC) into NodQ in 
several bacterial species (e.g. Rhizohium meliloti) [62], 
strongly suggestive of substrate channeling. The 
CysN/ C case has also been explored within an evo- 
lutionary context, as a case of a possible horizontal 
gene transfer (HGT) event followed by gene fusion. 
It has been proposed that multiple fusion events have 
occurred independently, where an archaeal or eu- 
karyotic CysN-like gene most similar to elongation 
factor-la gene (EF-la) was horizontally transferred 
into a bacterial species, from which secondary HGT 
events were spawned [63]. 

23 In P. aeruginosa, protein TagF participates in the 
transcriptional control of a type VI secretion system 
while at the same time synteny analysis revealed its 
potential association with PppA, a PP2C phosphatase 
[64]. In certain species, such as Agrobacterium tumefa- 
ciens, the two genes are fused into a composite, fur- 
ther corroborating this association, a finding, 
however, not supported by the particular report 
[64]. Indeed, the apparent absence of other associated 
proteins such as Fhal in Burkholderia thailandensis 
might be due to undetectable similarities and absence 
of syntenic involvement in published genome se- 
quences (data not shown). The complex recruitment 
sequence for the regulation of type VI secretion is 
another exemplary system where gene fusion might 
be responsible for the co-expression of critical genes 
in certain species. Involvement of the PP2C domain 
in complex configurations has been reported else- 
where [65]. 

24 Examining the involvement of genes CTF18 
and CTF4 in Saccharomyces cerevisiae through a series 
of rigorous experiments [66], a detailed network of 
physical and genetic interactions has been estab- 
lished. One of these genes, Esol, can be found in 
Schizosaccharomyces pombe as a fusion of two domains, 
namely, polymerase r| (I^D30, cd01702) and Ctf7 
(pfaml3880), suggesting a possible indirect 
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interaction [66], later confirmed by large-scale 
co-localization experiments [67]. This particular ex- 
ample is an excellent case of best practice from small- 
scale high-quality studies coupled with large-scale 
high-throughput studies, the main theme pro- 
pounded in this critical survey. 

25 An impressive example of detailed biochemical 
w^ork involving enzymes from the core isoprenoid 
precursor biosynthesis pathw^ay, namely, IspD/E/F 
from Campylobacter jejuni clearly demonstrates the 
presence of a composite protein IspD/F, correspond- 
ing to two enzymes catalyzing non-consecutive steps 
in this process, know^n to exist as components in 
other organisms including E. coli [68]. The enzyme 
IspE, catalyzing the intermediate step, is show^n to 
mediate this interaction in E. coli [68], thus providing 
further support for the hypothesis that gene fusion 
might provide a selective advantage for substrate 
channeling in some species. 

26 The structure determination of copper- 
containing nitrite reductase (CuNIR, NirK) Wixh 
its cognate cytochrome c (NirM) strongly suggests 
that this particular arrangement, supported by com- 
parative genomics evidence for the co-location of 
these genes in certain organisms, is indeed a func- 
tional complex [69]. The cytochrome c moiety has 
thus been proposed to participate as the electron 
donor for the function of CuNIR pointing to 
intra-protein heme-to-copper electron transfer, 
Wi\h component genes NirK and NirM found as 
fused genes (NirK/M composite) elsew^here [69]. 

27 Finally, examples of gene fusion involving 
extrachromosomal elements as indicated by the 
structure determination and sequence analysis of 
the Cass2 integron gene cassette-associated protein 
from an environmental Vibrio cholerae strain (OP4G) 
correspond to regions of DNA-binding (helix-turn- 
helix) motifs [70]. These motifs are characteristic of 
MarA and Rob homologs suggesting possible com- 
plex interactions of the corresponding monomers 
elsew^here [70] as v^ell as the critical significance of 
gene fusion events in generating protein sequence 
diversity and substrate specificity outside the recipi- 
ent genomes. 

Omics-supported studies for indirect protein 
associations 

28 In the case of human 17-(3-hydroxysteroid de- 
hydrogenase type 4 (17(3-HSD type 4), containing 
three consecutive domains v^ith direct involvement 
in the corresponding catalytic functions — namely. 



hydroxyacyl-CoA dehydrogenase (cd05353, 
FabG-like), enoyl-CoA hydratase (FabA-like) and 
SCP-2 sterol transfer domain (cl01225), there exist 
highly conserved multi-functional homologs in vari- 
ous taxa, including yeasts (w^here they are know^n as 
FOX2) [71]. The strong conservation and the 
presumed multiple events of fusion and fission, also 
involving the occasional loss (e.g. SCP-2, 
01:328711512) or duplication (e.g. FabG-like, 
GI:5869811) of single domains, further suggest a 
strong association of these individual functional 
elements in the pathw^ay [71]. 

29 In G elegans and Drosophila melanogaster, Kua and 
UEV (a variant E2 ubiquitin-conjugating enzyme) 
are expressed independently and are found at differ- 
ent loci. The human homologs UEVl and Kua are 
adjacent to each other and expressed either as separ- 
ate transcripts or as a hybrid transcript, encoding a 
fused composite protein [72]. Experimental analysis 
of cellular localization indicates that the two variants 
(i.e. non-hybrid and hybrid) reach different destin- 
ations w^ithin the cell [72]. 

30 Patchy phylogenetic distribution of genes does 
not aWays imply HGT, as show^n in the case of the 
Ras-like GTPase RJL family of unknow^n function, 
w^here gene loss has been implicated in a number of 
occasions involving taxa v^ithout flagellated cells, 
thus suggesting a role with the flagellar apparatus. 
In two cases, RJL members were fused v^^ith an N- 
or C-terminal DnaJ (Hsp70) domain, the Alveolata 
and Holozoa, respectively [73]. 

DISCUSSION AND FUTURE 
PROSPECTS 

This comprehensive survey of individual cases of 
protein interaction discovery through computation- 
driven experiment or experimentally derived com- 
putational inference strongly suggests that gene 
fusion detection can be a valuable tool for modem, 
high-throughput proteomics [1]. The corroborating 
evidence derived from this limited, high-quality data 
set unambiguously demonstrates that in most, if not 
all, cases, gene fusions can direct tow^ard potential 
protein interactions w^ith high accuracy and reason- 
ably good coverage. One prerequisite is the availabil- 
ity of an entire genome sequence for the species 
under consideration, a condition that is increasingly 
more relaxed w^ith more genome sequences becom- 
ing available. Another prerequisite is evidently cor- 
rect gene prediction, so that the domain structure of 
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Table 3: Examples of component pairs detected by gene fusion in the S. cerevisiae interactome 



Case Component I Connponent 2 Found? 



08 YER042W YCL033C 

II YER027C YGLII5W 

13 YJLI26W YDR305C 

21 YBRII8W YKLOOIC 

23 YDR4I9W YFR027W 



Source: http://www.yeastnet.org/data/yeastnet2.orf.txt [76]. 

encoded proteins is accurately reflected in the 
sequence, a condition that is not always easy to satisfy 
by next-generation sequencing technology with se- 
quences obtained by short-read sequence assemblies. 

As mentioned earlier, it is somewhat ironic that 
while the genome-aware methods were developed as 
a way to augment experimental work in proteomics, 
most such large-scale studies in the literature do not 
report (or cite) the use of any of those methods as a 
validation mechanism for high-throughput experi- 
ments. Indeed, the majority of citations for these 
methods (Table 1) arise from similar computational 
work, technical extensions, general reviews and sen- 
sational commentary, written in the past. We hope 
that we now provide the argument for more exten- 
sive use of gene fusion analysis for proteomics. 

One could envision a setting where this gold- 
standard corpus expands to a significant degree and 
can be used primarily to assess the coverage of protein 
interaction detection by experiments. One example, 
with the limited information available today follows. 
Of the 30 cases (Table 2), there are five readily de- 
tectable cases of orthologous gene pairs in the genome 
of S. cerevisiae S288c (Table 3). We chose this organism 
for two reasons, first for its extensively studied inter- 
actome and second for its consistent and easy-to-use 
gene name catalog. Searching for these pairs in the 
source database listed [76], it can be found that four 
out of the five cases can be detected as interaction 
partners, indicating a high coverage, in this instance 
80% — of course this estimate is by way of example, as 
a deeper analysis and statistical treatment will be ne- 
cessary in real- world settings. 

Thus, it must be appreciated that with the avail- 
ability of an ever increasing number of genomes 
acting as reference, i.e. providing composite back- 
ground protein sequences, this approach can 
become a benchmark for protein interaction re- 
search. We have extensively reviewed the available 
experimental evidence in the literature and have 
found that, while protein interaction data processing 
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Yes 3252888 

Yes 18390971 

Yes 9955180 

Yes 46313 

No 7678718 



has been maturing over the past few years [77], the 
inference of protein interactions has not been inte- 
grated to a sufficient degree, at least as this is reflected 
by citation analysis. The development of multiple 
methods that compile experimentally derived pro- 
tein interactions from curated databases, process the 
interaction graphs, cluster related modules, discover 
novel associations and visualize them [77, 78] appears 
to have out-shined valuable genome-aware infer- 
ence methods. 

It is encouraging to see parallel studies that exam- 
ine the micro-evolutionary mechanisms of these 
events in one genus e.g. Drosophila [79] and the fur- 
ther investigation of concurrent gene (i.e. domain) 
loss events, for example the repertoire of Myb do- 
mains lost in fungal zuotins from MID Al -like factors 
[80]. Given the wider availability of genomic and 
metagenomic information, we predict that gene 
fusion detection and subsequent inference of func- 
tional associations will become more common and 
applicable to large-scale studies of protein inter- 
action. The best-practice examples that are provided 
herein point the way for the critical importance of 
integration of inference and validation methods for 
protein interaction detection and how trailblazing 
small-scale studies pave the way for large-scale 
proteomics. 

The sheer power of evolutionary thinking behind 
protein interaction analysis [81, 82] can thus reveal 
the conservation and diversification of interacting 
modules, enriched by functional genomics data for 
example gene expression or cellular localization and 
further our understanding of the complex pathways 
that govern cell biology. 



Key Points 

• Gene fusion analysis is one of the most successful computational 
methods for the detection of genome-wide protein interactions. 

• Compared with other methods that take into account genome 
structure and evolution, gene fusion has a relatively low coverage 
of known interactions but high precision. 
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• Despite high citation rates, these methods do not appear to have 
been used extensively in high-throughput proteomics. 

• i^any examples from individual case studies listed here have 
demonstrated that this method is applicable as a validation ap- 
proach for proteomics. 

• Evolutionary thinking in support of protein interaction analysis 
can reveal the conservation and diversification of interacting 
modules in cellular pathways. 
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